Measuring the Algorithmic Convergence of Random Forests via Bootstrap Extrapolation

نویسنده

  • Miles E. Lopes
چکیده

When making predictions with a voting rule, a basic question arises: “What is the smallest number of votes needed to make a good prediction?” In the context of ensemble classifiers, such as Random Forests or Bagging, this question represents a tradeoff between computational cost and statistical performance. Namely, by paying a larger computational price for more classifiers, the prediction error of the ensemble tends to improve and become more stable. Conversely, by using fewer classifiers and tolerating some variability in accuracy, it is possible to speed up the tasks of training the ensemble and making new predictions. In this paper, we propose a bootstrap method to quantify this tradeoff for the methods of Bagging and Random Forests. To be specific, suppose the training dataset is fixed, and let the random variable Errt denote the prediction error of a randomly generated ensemble of t = 1, 2, . . . classifiers. (The randomness of Errt comes only from the algorithmic randomness of the ensemble.) Working under a “first order model” of Random Forests, we prove that the centered law of Errt can be consistently estimated via our proposed method as t→∞. As a consequence, this result offers practitioners a guideline for choosing the smallest number of base classifiers needed to ensure that the algorithmic fluctuations are negligible, e.g. var(Errt) less than a given threshold.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Random Forests for Big Data

Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include data streams and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based o...

متن کامل

Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests

This work develops formal statistical inference procedures for predictions generated by supervised learning ensembles. Ensemble methods based on bootstrapping, such as bagging and random forests, have improved the predictive accuracy of individual trees, but fail to provide a framework in which distributional results can be easily determined. Instead of aggregating full bootstrap samples, we co...

متن کامل

Richardson Extrapolation and the Bootstrap

Simulation methods, in particular Efron's (1979) bootstrap, are being applied more and more widely in statistical inference. Given data, (X1,* ,Xn), distributed according to P belonging to a hypothesized model P the basic goal is to estimate the distribution Lp of a function Tn (X1, * *Xn,P). The bootstrap presupposes the existence of an estimate P (X1, Xn) and consists of estimating Lp by the ...

متن کامل

The stability of feature selection and class prediction from ensemble tree classifiers

The bootstrap aggregating procedure at the core of ensemble tree classifiers reduces, in most cases, the variance of such models while offering good generalization capabilities. The average predictive performance of those ensembles is known to improve up to a certain point while increasing the ensemble size. The present work studies this convergence in contrast to the stability of the class pre...

متن کامل

Impact of Measuring Devices and Data Analysis on the Determination of Gas Membrane Properties

The time-lag method, using a gas permeation experiment, is currently the most popular method for determining the membrane properties: diffusivity coefcient and permeability coefcient, and from which the solubility coefcient can be calculated. In this investigation, the impact of systematic, random (noise), resolution and extrapolation errors associated with gas permeatio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015